32 research outputs found

    Identification and evolutionary analysis of novel exons and alternative splicing events using cross-species EST-to-genome comparisons in human, mouse and rat

    Get PDF
    BACKGROUND: Alternative splicing (AS) is important for evolution and major biological functions in complex organisms. However, the extent of AS in mammals other than human and mouse is largely unknown, making it difficult to study AS evolution in mammals and its biomedical implications. RESULTS: Here we describe a cross-species EST-to-genome comparison algorithm (ENACE) that can identify novel exons for EST-scanty species and distinguish conserved and lineage-specific exons. The identified exons represent not only novel exons but also evolutionarily meaningful AS events that are not previously annotated. A genome-wide AS analysis in human, mouse and rat using ENACE reveals a total of 758 novel cassette-on exons and 167 novel retained introns that have no EST evidence from the same species. RT-PCR-sequencing experiments validated ~50 ~80% of the tested exons, indicating high presence of exons predicted by ENACE. ENACE is particularly powerful when applied to closely related species. In addition, our analysis shows that the ENACE-identified AS exons tend not to pass the nonsynonymous-to-synonymous substitution ratio test and not to contain protein domain, implying that such exons may be under positive selection or relaxed negative selection. These AS exons may contribute to considerable inter-species functional divergence. Our analysis further indicates that a large number of exons may have been gained or lost during mammalian evolution. Moreover, a functional analysis shows that inter-species divergence of AS events may be substantial in protein carriers and receptor proteins in mammals. These exons may be of interest to studies of AS evolution. The ENACE programs and sequences of the ENACE-identified AS events are available for download. CONCLUSION: ENACE can identify potential novel cassette exons and retained introns between closely related species using a comparative approach. It can also provide information regarding lineage- or species-specificity in transcript isoforms, which are important for evolutionary and functional studies

    INDELSCAN: a web server for comparative identification of species-specific and non-species-specific insertion/deletion events

    Get PDF
    Insertion and deletion (indel) events usually have dramatic effects on genome structure and gene function. Species-specific indels have been demonstrated to be associated with species-unique traits. Currently, indel identifications mainly rely on pair-wise sequence alignments (the ‘pair-wise indels’), which suffer lack of discrimination of species specificity and insertion versus deletion. Also, there is no freely accessible web server for genome-wide identification of indels. Therefore, we develop a web server—INDELSCAN— to identify four types of indels using multiple sequence alignments that include sequences from one target, one subject and ≥1 out-group species. The four types of indels identified encompass target species-specific, subject species-specific, non-species-specific and target-subject pair-wise indels. Insertions and deletions are discriminated with reference to out-group sequences. The genomic locations (5′UTR, intron, CDS, 3′UTR and intergenic region) of these indels are also provided for functional analysis. INDELSCAN provides genomic sequences and gene annotations from a wide spectrum of taxa for users to select from, including nine target species (human (Homo sapiens), mouse (Mus musculus), rat (Rattus norvegicus), dog (Canis familiaris), opossum (Monodelphis domestica), chicken (Gallus gallus), zebrafish (Danio rerio), fly (Drosophila melanogaster) and yeast (Saccharomyces cerevisiae) and >35 subject/out-group species, ranging from yeasts to mammals. The server also provides analytic figures and supports indel identification from user-uploaded alignments/annotations. INDELSCAN is freely accessible at http://indelscan.genomics.sinica.edu.tw/IndelScan/

    CAPIH: A Web interface for comparative analyses and visualization of host-HIV protein-protein interactions

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Human Immunodeficiency Virus type one (HIV-1) is the major causing pathogen of the Acquired Immune Deficiency Syndrome (AIDS). A large number of HIV-1-related studies are based on three non-human model animals: chimpanzee, rhesus macaque, and mouse. However, the differences in host-HIV-1 interactions between human and these model organisms have remained unexplored.</p> <p>Description</p> <p>Here we present CAPIH (Comparative Analysis of Protein Interactions for HIV-1), the first web-based interface to provide comparative information between human and the three model organisms in the context of host-HIV-1 protein interactions. CAPIH identifies genetic changes that occur in HIV-1-interacting host proteins. In a total of 1,370 orthologous protein sets, CAPIH identifies ~86,000 amino acid substitutions, ~21,000 insertions/deletions, and ~33,000 potential post-translational modifications that occur only in one of the four compared species. CAPIH also provides an interactive interface to display the host-HIV-1 protein interaction networks, the presence/absence of orthologous proteins in the model organisms in the networks, the genetic changes that occur in the protein nodes, and the functional domains and potential protein interaction hot sites that may be affected by the genetic changes. The CAPIH interface is freely accessible at <url>http://bioinfo-dbb.nhri.org.tw/capih</url>.</p> <p>Conclusion</p> <p>CAPIH exemplifies that large divergences exist in disease-associated proteins between human and the model animals. Since all of the newly developed medications must be tested in model animals before entering clinical trials, it is advisable that comparative analyses be performed to ensure proper translations of animal-based studies. In the case of AIDS, the host-HIV-1 protein interactions apparently have differed to a great extent among the compared species. An integrated protein network comparison among the four species will probably shed new lights on AIDS studies.</p

    Scanning for the Signatures of Positive Selection for Human-Specific Insertions and Deletions

    Get PDF
    Human-specific small insertions and deletions (HS indels, with lengths <100 bp) are reported to be ubiquitous in the human genome. However, whether these indels contribute to human-specific traits remains unclear. Here we employ a modified McDonald–Kreitman (MK) test and a combinatorial population genetics approach to infer, respectively, the occurrence of positive selection and recent selective sweep events associated with HS indels. We first extract 625,890 HS indels from the human–chimpanzee–macaque–mouse multiple alignments and classify them into nonpolymorphic (41%) and polymorphic (59%) indels with reference to the human indel polymorphism data. The modified MK test is then applied to 100-kb partially overlapped sliding windows across the human genome to scan for the signs of positive selection. After excluding the possibility of biased gene conversion and controlling for false discovery rate, we show that HS indels are potentially positively selected in about 10 Mb of the human genome. Furthermore, the indel-associated positively selected regions overlap with genes more often than expected. However, our result suggests that the potential targets of positive selection are located in noncoding regions. Meanwhile, we also demonstrate that the genomic regions surrounding HS indels are more frequently involved in recent selective sweep than the other regions. In addition, HS indels are associated with distinct recent selective sweep events in different human subpopulations. Our results suggest that HS indels may have been associated with human adaptive changes at both the species level and the subpopulation level

    The effects of multiple features of alternatively spliced exons on the K(A)/K(S )ratio test

    Get PDF
    BACKGROUND: The evolution of alternatively spliced exons (ASEs) is of primary interest because these exons are suggested to be a major source of functional diversity of proteins. Many exon features have been suggested to affect the evolution of ASEs. However, previous studies have relied on the K(A)/K(S )ratio test without taking into consideration information sufficiency (i.e., exon length > 75 bp, cross-species divergence > 5%) of the studied exons, leading to potentially biased interpretations. Furthermore, which exon feature dominates the results of the K(A)/K(S )ratio test and whether multiple exon features have additive effects have remained unexplored. RESULTS: In this study, we collect two different datasets for analysis – the ASE dataset (which includes lineage-specific ASEs and conserved ASEs) and the ACE dataset (which includes only conserved ASEs). We first show that information sufficiency can significantly affect the interpretation of relationship between exons features and the K(A)/K(S )ratio test results. After discarding exons with insufficient information, we use a Boolean method to analyze the relationship between test results and four exon features (namely length, protein domain overlapping, inclusion level, and exonic splicing enhancer (ESE) frequency) for the ASE dataset. We demonstrate that length and protein domain overlapping are dominant factors, and they have similar impacts on test results of ASEs. In addition, despite the weak impacts of inclusion level and ESE motif frequency when considered individually, combination of these two factors still have minor additive effects on test results. However, the ACE dataset shows a slightly different result in that inclusion level has a marginally significant effect on test results. Lineage-specific ASEs may have contributed to the difference. Overall, in both ASEs and ACEs, protein domain overlapping is the most dominant exon feature while ESE frequency is the weakest one in affecting test results. CONCLUSION: The proposed method can easily find additive effects of individual or multiple factors on the K(A)/K(S )ratio test results of exons. Therefore, the system can analyze complex conditions in evolution where multiple features are involved. More factors can also be added into the system to extend the scope of evolutionary analysis of exons. In addition, our method may be useful when orthologous exons can not be found for the K(A)/K(S )ratio test

    Large expert-curated database for benchmarking document similarity detection in biomedical literature search

    Get PDF
    Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.Peer reviewe

    NCLcomparator: systematically post-screening non-co-linear transcripts (circular, trans-spliced, or fusion RNAs) identified from various detectors

    No full text
    Abstract Background Non-co-linear (NCL) transcripts consist of exonic sequences that are topologically inconsistent with the reference genome in an intragenic fashion (circular or intragenic trans-spliced RNAs) or in an intergenic fashion (fusion or intergenic trans-spliced RNAs). On the basis of RNA-seq data, numerous NCL event detectors have been developed and detected thousands of NCL events in diverse species. However, there are great discrepancies in the identification results among detectors, indicating a considerable proportion of false positives in the detected NCL events. Although several helpful guidelines for evaluating the performance of NCL event detectors have been provided, a systematic guideline for measurement of NCL events identified by existing tools has not been available. Results We develop a software, NCLcomparator, for systematically post-screening the intragenic or intergenic NCL events identified by various NCL detectors. NCLcomparator first examine whether the input NCL events are potentially false positives derived from ambiguous alignments (i.e., the NCL events have an alternative co-linear explanation or multiple matches against the reference genome). To evaluate the reliability of the identified NCL events, we define the NCL score (NCL score ) based on the variation in the number of supporting NCL junction reads identified by the tools examined. Of the input NCL events, we show that the ambiguous alignment-derived events have relatively lower NCL score values than the other events, indicating that an NCL event with a higher NCL score has a higher level of reliability. To help selecting highly expressed NCL events, NCLcomparator also provides a series of useful measurements such as the expression levels of the detected NCL events and their corresponding host genes and the junction usage of the co-linear splice junctions at both NCL donor and acceptor sites. Conclusion NCLcomparator provides useful guidelines, with the input of identified NCL events from various detectors and the corresponding paired-end RNA-seq data only, to help users selecting potentially high-confidence NCL events for further functional investigation. The software thus helps to facilitate future studies into NCL events, shedding light on the fundamental biology of this important but understudied class of transcripts. NCLcomparator is freely accessible at https://github.com/TreesLab/NCLcomparator
    corecore